Sentiment Analysis Research with the HathiTrust Digital Library¶

Contents¶

  1. Overview
  2. Acquiring the Textual Data
  3. Installing Dependencies
  4. Fiction Example: The Count of Monte Cristo
    1. Full Text Analysis
    2. Extracted Features Analysis
    3. Emotional Valence Graph
  5. Nonfiction Example: The Origin of Species
    1. Full Text Analysis
    2. Extracted Features Analysis
    3. Emotional Valence Graph
  6. Exploring Large Language Models (LLMs)
  7. Conclusion
  8. Further Readings

Overview¶

In this project, I conducted sentiment analysis using Python on two volumes from the HathiTrust Digital Library to examine how the emotional valence changes across each text. The book titles I analyzed were one fiction novel The Count of Monte Cristo by Alexandre Dumas and one nonfiction text The Origin of Species by Charles Darwin. My goal was to generate visualizations of the change in emotional valence over the span of these books.

I utilized two forms of textual data for my analysis: full text (TXT) files downloaded directly from HathiTrust Digital Library, as well as Extracted Features (EF) obtained from HathiTrust Research Center (HTRC) Analytics. HTRC Analytics enables non-profit research and educational uses of materials in the HathiTrust collection, including those still under copyright. Specifically, the Extracted Features contain metadata about volumes and pages alongside part-of-speech-tagged tokens and token counts extracted from full texts.

With this textual data, I performed sentiment analysis using three tools: VADER, TextBlob, and AFINN. Each tool assigns sentiment scores to input texts which can be aggregated and visualized to show how emotional valence shifts across the span of pages in each book.

Acquiring the Textual Data¶

To download the full-text files for entire volumes from HathiTrust, one needs to be affiliated with a HathiTrust member institution and logged into the HathiTrust website using institutional credentials.

Full text files can be directly downloaded from each item's page. Extracted features must be accessed through HTRC Analytics, following this EF download tutorial.

For this project, I downloaded the following data files:

The Count of Monte Cristo: mdp-39015062136661-1693964099.txt (full text) and mdp.39015062136661.json.bz2 (Extracted Features)

On the Origin of Species: hvd-hw39sc-1696432701.txt (full text) and hvd.hw39sc.json.bz2 (Extracted Features)

Installing Dependencies¶

In this project, I have selected three widely-used sentiment analysis tools:

  1. VADER: Part of the NLTK (Natural Language Toolkit), VADER is specifically designed for social media texts, adept at handling informal language, emojis, and slang.
  2. TextBlob: A user-friendly library, TextBlob simplifies many common natural language processing (NLP) tasks, including sentiment analysis.
  3. AFINN: AFINN is a wordlist-based tool for sentiment analysis where each word in the list is rated for its sentiment strength.

These tools can be installed in a Python environment using pip commands:

pip install nltk
pip install textblob
pip install afinn

To analyze Extracted Features, htrc-feature-reader is needed. It is a tool designed specifically to work with Extracted Features from HTRC.

pip install htrc-feature-reader

Other modules needed for the project include pandas and plotly. The first tool is used for data analysis, and the second is for plotting interactive graphs.

pip install pandas
pip install plotly==5.18.0

Fiction Example: The Count of Monte Cristo¶

Full Text Analysis¶

With the texts downloaded and the required dependencies installed, I was ready to begin the sentiment analysis process. The first step was importing the necessary libraries and modules:

In [1]:
# Import libraries for data analysis and visualization
import re  # for regular expression operations
import pandas as pd
import plotly.graph_objects as go

# Import the sentiment analysis tools
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from afinn import Afinn

The full-text file contained the entire book of The Count of Monte Cristo without any page breaks. To analyze the text, I needed to structure it into a more manageable format. I aimed to turn the text into a DataFrame with two columns: one for page numbers and the other for the content on each page.

While looking at the TXT file, I noticed markers that indicate page breaks, formatted as ## p. (#1) #################################################. I decided to use regular expressions (RegEx) to identify these markers and parse the text accordingly.

In [2]:
# Open the file and split the text into lines
with open('mdp-39015062136661-1693964099.txt', 'r', encoding='utf-8') as file:
    text = file.read()
lines = text.split('\n')

# Initialize lists for page numbers and content
page_numbers = []
page_content = []

current_page_number = None
current_page_content = []

# Parse the text with RegEx
for line in lines:
    if line.startswith("## p. "):
        if current_page_number is not None:
            page_numbers.append(current_page_number)
            page_content.append(" ".join(current_page_content))

        page_pattern = r'## p\. (\d+)'
        match = re.match(page_pattern, line)
        if match:
            current_page_number = match.group(1)
            current_page_content = []
    else:
        current_page_content.append(line)

Next, I created a DataFrame named dumas_full_text to organize page_numbers and page_content.

In [3]:
dumas_full_text = pd.DataFrame({
    'page_number': [int(x) for x in page_numbers],
    'page_content': page_content,
})

Here is a preview of the DataFrame for pages 11 to 15:

In [4]:
dumas_full_text[10:15]
Out[4]:
page_number page_content
10 11 THE COUNT OF MONTE CRISTO 11 navy—a costume s...
11 12 12 THE COUNT OF MONTE CRISTO 1 piness is like...
12 13 THE COUNT OF MONTE CRISTO 13 elder Dantès, wh...
13 14 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,...
14 15 THE COUNT OF MONTE CRISTO 15 “You understand ...
VADER¶

With the DataFrame ready, I proceeded with sentiment analysis using the VADER tool:

In [5]:
analyzer = SentimentIntensityAnalyzer()

dumas_full_text['vader_sentiment_score'] = 0.0
dumas_full_text['vader_sentiment'] = ""

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    sentiment_dictionary = analyzer.polarity_scores(sentence)
    compound = sentiment_dictionary['compound']

    dumas_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound

    if compound >= 0.33:
        vader_sentiment = "Positive"
    elif compound <= -0.33:
        vader_sentiment = "Negative"
    else:
        vader_sentiment = "Neutral"

    dumas_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment

dumas_full_text[10:15]
Out[5]:
page_number page_content vader_sentiment_score vader_sentiment
10 11 THE COUNT OF MONTE CRISTO 11 navy—a costume s... 0.9972 Positive
11 12 12 THE COUNT OF MONTE CRISTO 1 piness is like... 0.8093 Positive
12 13 THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... 0.8625 Positive
13 14 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... -0.5018 Negative
14 15 THE COUNT OF MONTE CRISTO 15 “You understand ... 0.9674 Positive
TextBlob¶

The process for sentiment analysis with TextBlob is similar to that with VADER:

In [6]:
dumas_full_text['textblob_sentiment_score'] = 0.0
dumas_full_text['textblob_sentiment'] = ""

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    classifier = TextBlob(sentence)
    polarity = classifier.sentiment.polarity

    dumas_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity

    if polarity >= 0.1:
        textblob_sentiment = "Positive"
    elif polarity <= -0.1:
        textblob_sentiment = "Negative"
    else:
        textblob_sentiment = "Neutral"

    dumas_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment

dumas_full_text[10:15]
Out[6]:
page_number page_content vader_sentiment_score vader_sentiment textblob_sentiment_score textblob_sentiment
10 11 THE COUNT OF MONTE CRISTO 11 navy—a costume s... 0.9972 Positive 0.239049 Positive
11 12 12 THE COUNT OF MONTE CRISTO 1 piness is like... 0.8093 Positive 0.108443 Positive
12 13 THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... 0.8625 Positive 0.090585 Neutral
13 14 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... -0.5018 Negative 0.090830 Neutral
14 15 THE COUNT OF MONTE CRISTO 15 “You understand ... 0.9674 Positive 0.052443 Neutral
AFINN¶

Lastly, I used AFINN to analyze the sentiment across the full text of The Count of Monte Cristo.

In [7]:
afinn = Afinn(language='en')

dumas_full_text['afinn_sentiment_score'] = 0.0

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    score = afinn.score(sentence)

    dumas_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score

dumas_full_text[10:15]
Out[7]:
page_number page_content vader_sentiment_score vader_sentiment textblob_sentiment_score textblob_sentiment afinn_sentiment_score
10 11 THE COUNT OF MONTE CRISTO 11 navy—a costume s... 0.9972 Positive 0.239049 Positive 38.0
11 12 12 THE COUNT OF MONTE CRISTO 1 piness is like... 0.8093 Positive 0.108443 Positive 12.0
12 13 THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... 0.8625 Positive 0.090585 Neutral 0.0
13 14 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... -0.5018 Negative 0.090830 Neutral -16.0
14 15 THE COUNT OF MONTE CRISTO 15 “You understand ... 0.9674 Positive 0.052443 Neutral 0.0

One thing to note is that the scale of the afinn_sentiment_score is different from the scales of vader_sentiment_score and textblob_sentiment_score. While VADER and TextBlob's range is between -1 and 1, AFINN's scores are sums of sentiment values of individual words, which resulted in much larger absolute values. Therefore, I needed to normalize AFINN to the range of -1 to 1.

In [8]:
# Normalize the AFINN sentiment scores
min_value = min(dumas_full_text['afinn_sentiment_score'])
max_value = max(dumas_full_text['afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_full_text['afinn_sentiment_score']]

# Adjust the normalized numbers to the -1 to 1 range
afinn_normalized = [2 * x - 1 for x in normalized_numbers]

dumas_full_text['afinn_normalized'] = afinn_normalized

dumas_full_text[10:15]
Out[8]:
page_number page_content vader_sentiment_score vader_sentiment textblob_sentiment_score textblob_sentiment afinn_sentiment_score afinn_normalized
10 11 THE COUNT OF MONTE CRISTO 11 navy—a costume s... 0.9972 Positive 0.239049 Positive 38.0 0.525926
11 12 12 THE COUNT OF MONTE CRISTO 1 piness is like... 0.8093 Positive 0.108443 Positive 12.0 0.140741
12 13 THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... 0.8625 Positive 0.090585 Neutral 0.0 -0.037037
13 14 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... -0.5018 Negative 0.090830 Neutral -16.0 -0.274074
14 15 THE COUNT OF MONTE CRISTO 15 “You understand ... 0.9674 Positive 0.052443 Neutral 0.0 -0.037037

Extracted Features Analysis¶

Moving on to analyzing the Extracted Features, I first imported FeatureReader from the htrc_features library.

In [9]:
from htrc_features import FeatureReader
import warnings
# The warnings are suppressed to avoid clutter in the output and do not affect the program's functionality
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
In [10]:
paths = ['mdp.39015062136661.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())

Then, I created a DataFrame for Extracted Features named dumas_ef. I also grouped the tokens by page number, so that I could analyze the content on each page in a way similar to full text.

In [11]:
dumas_ef = vol.tokenlist(pos=False, case=False)\
        .reset_index().drop(['section'], axis=1)
dumas_ef.columns = ['Page Number', 'token', 'count']
In [12]:
# Group tokens by page number
grouped_tokens = dumas_ef.groupby('Page Number')

The next section of the code combines the sentiment analysis of the EF with the three tools: VADER, TextBlob and AFINN.

In [13]:
# Initialize lists to store sentiment analysis results
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []

# Perform sentiment analysis for each page
for name, group in grouped_tokens:
    page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])

    # VADER Analysis
    sentiment_scores = analyzer.polarity_scores(page_text)
    ef_vader_sentiment_score.append(sentiment_scores['compound'])

    # TextBlob Analysis
    sentiment = TextBlob(page_text).sentiment
    ef_textblob_sentiment_score.append(sentiment.polarity)

    # AFINN Analysis
    sentiment_score = afinn.score(page_text)
    ef_afinn_sentiment_score.append(sentiment_score)

# Create a DataFrame with all sentiment analysis results
dumas_ef = pd.DataFrame({
    'page_number': [int(x) for x in grouped_tokens.groups.keys()],
    'ef_vader_sentiment_score': ef_vader_sentiment_score,
    'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
    'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})

dumas_ef[10:15]
Out[13]:
page_number ef_vader_sentiment_score ef_textblob_sentiment_score ef_afinn_sentiment_score
10 18 0.9524 0.155742 0.0
11 19 0.8484 0.156593 2.0
12 20 0.6795 0.133160 4.0
13 21 0.8987 0.108465 10.0
14 22 0.9511 0.008393 12.0

Again, I needed to normalize the AFINN scores.

In [14]:
min_value = min(dumas_ef['ef_afinn_sentiment_score'])
max_value = max(dumas_ef['ef_afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_ef['ef_afinn_sentiment_score']]
ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]

dumas_ef['ef_afinn_normalized'] = ef_afinn_normalized

dumas_ef[['ef_afinn_sentiment_score', 'ef_afinn_normalized']][10:15]
Out[14]:
ef_afinn_sentiment_score ef_afinn_normalized
10 0.0 -0.075269
11 2.0 -0.032258
12 4.0 0.010753
13 10.0 0.139785
14 12.0 0.182796

Another thing that needed to be adjusted is the page numbers in the dumas_ef DataFrame. The original page numbers in dumas_ef were higher because they included pages from the book's front matter, such as the cover and copyright pages. By subtracting 16, the number of front matter of this book, I ensured that the page numbers in dumas_ef match those in dumas_full_text for meaningful comparison.

In [15]:
FRONT_MATTER_PAGES = 16
dumas_ef['page_number'] = dumas_ef['page_number'].astype(int) - FRONT_MATTER_PAGES
In [16]:
dumas_ef[10:15]
Out[16]:
page_number ef_vader_sentiment_score ef_textblob_sentiment_score ef_afinn_sentiment_score ef_afinn_normalized
10 2 0.9524 0.155742 0.0 -0.075269
11 3 0.8484 0.156593 2.0 -0.032258
12 4 0.6795 0.133160 4.0 0.010753
13 5 0.8987 0.108465 10.0 0.139785
14 6 0.9511 0.008393 12.0 0.182796

Emotional Valence Graph¶

My goal is to visualize the overall emotional trend throughout the book. For this, a detailed granularity is not necessary. Therefore, I used a rolling mean with a window size of 20 pages to smooth out the data.

In [17]:
WINDOW_SIZE = 20

# Columns in dumas_full_text for rolling window operation
columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']

for col in columns_full_text:
    dumas_full_text[col] = dumas_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

# Columns in dumas_ef for rolling window operation
columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']

for col in columns_ef:
    dumas_ef[col] = dumas_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

Finally, I used plotly to create an interactive graph with all the processed data. The interactive graph enables the viewer to toggle which graph(s) they would like to see and allows for a more straightforward comparison.

In [18]:
fig = go.Figure()

# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
    'vader_sentiment_score': 'Full Text VADER',
    'textblob_sentiment_score': 'Full Text TextBlob',
    'afinn_normalized': 'Full Text AFINN Normalized'
}

for column, label in sentiment_scores_full_text.items():
    fig.add_trace(go.Scatter(x=dumas_full_text['page_number'], y=dumas_full_text[column], mode='lines', name=label))

# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
    'ef_vader_sentiment_score': 'EF VADER',
    'ef_textblob_sentiment_score': 'EF TextBlob',
    'ef_afinn_normalized': 'EF AFINN Normalized'
}

for column, label in sentiment_scores_ef.items():
    fig.add_trace(go.Scatter(x=dumas_ef['page_number'], y=dumas_ef[column], mode='lines', name=label))

# Additional plot settings
fig.update_layout(
    title=f'Emotional Valence throughout "The Count of Monte Cristo"',
    xaxis_title='Page Number',
    yaxis_title='Sentiment Score',
    showlegend=True
)

fig.show()

Nonfiction Example: The Origin of Species¶

Full Text Analysis¶

Moving on to the nonfiction example: The Origin of Species. The process was largely the same, with only a few minor differences.

In this book's full-text file, the page numbering resets after page 400, marking the start of Part 2. To ensure a continuous page count throughout the book, I implemented an offset that allows the numbering to continue past 400 instead of restarting.

In [19]:
with open('hvd-hw39sc-1696432701.txt', 'r', encoding='utf-8') as file:
    text = file.read()
lines = text.split('\n')

page_numbers = []
page_content = []

current_page_number = None
current_page_content = []
offset = 0  # Initialize an offset

for line in lines:
    if line.startswith("## p. "):
        if current_page_number is not None:
            page_numbers.append(current_page_number)
            page_content.append(" ".join(current_page_content))

        page_pattern = r'## p\. (\d+)'
        match = re.match(page_pattern, line)
        if match:
            # Check for page number reset
            if int(match.group(1)) == 1 and current_page_number is not None:
                offset += current_page_number  # Update the offset with the last page number

            current_page_number = int(match.group(1)) + offset
            current_page_content = []
    else:
        current_page_content.append(line)

As with the fiction example, I created a DataFrame for the full text named darwin_full_text.

In [20]:
darwin_full_text = pd.DataFrame({
    'page_number': [int(x) for x in page_numbers],
    'page_content': page_content,
})
darwin_full_text[10:15]
Out[20]:
page_number page_content
10 11 A HISTORICAL SKETCH OF THE PROGRESS OF OPINIO...
11 12 12 HISTORICAL SKETCH times has treated it in ...
12 13 HISTORICAL SKETCH 13 he maintains that such f...
13 14 14 HISTORICAL SKETCH tinctly recognizes the p...
14 15 HISTORICAL SKETCH 15 The Hon. and Rev. W. Her...
VADER¶
In [21]:
analyzer = SentimentIntensityAnalyzer()

darwin_full_text['vader_sentiment_score'] = 0.0
darwin_full_text['vader_sentiment'] = ""

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    sentiment_dictionary = analyzer.polarity_scores(sentence)
    compound = sentiment_dictionary['compound']

    darwin_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound

    if compound >= 0.33:
        vader_sentiment = "Positive"
    elif compound <= -0.33:
        vader_sentiment = "Negative"
    else:
        vader_sentiment = "Neutral"

    darwin_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment
TextBlob¶
In [22]:
darwin_full_text['textblob_sentiment_score'] = 0.0
darwin_full_text['textblob_sentiment'] = ""

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    classifier = TextBlob(sentence)
    polarity = classifier.sentiment.polarity

    darwin_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity

    if polarity >= 0.1:
        textblob_sentiment = "Positive"
    elif polarity <= -0.1:
        textblob_sentiment = "Negative"
    else:
        textblob_sentiment = "Neutral"

    darwin_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment
AFINN¶
In [23]:
afinn = Afinn(language='en')

darwin_full_text['afinn_sentiment_score'] = 0.0

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    score = afinn.score(sentence)

    darwin_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score

# Normalize AFINN scores
min_value = min(darwin_full_text['afinn_sentiment_score'])
max_value = max(darwin_full_text['afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_full_text['afinn_sentiment_score']]

afinn_normalized = [2 * x - 1 for x in normalized_numbers]

darwin_full_text['afinn_normalized'] = afinn_normalized

darwin_full_text[10:15]
Out[23]:
page_number page_content vader_sentiment_score vader_sentiment textblob_sentiment_score textblob_sentiment afinn_sentiment_score afinn_normalized
10 11 A HISTORICAL SKETCH OF THE PROGRESS OF OPINIO... 0.9172 Positive 0.092385 Neutral 12.0 0.062500
11 12 12 HISTORICAL SKETCH times has treated it in ... 0.9807 Positive 0.151984 Positive 20.0 0.229167
12 13 HISTORICAL SKETCH 13 he maintains that such f... 0.5499 Positive 0.043086 Neutral 1.0 -0.166667
13 14 14 HISTORICAL SKETCH tinctly recognizes the p... 0.9829 Positive 0.157964 Positive 7.0 -0.041667
14 15 HISTORICAL SKETCH 15 The Hon. and Rev. W. Her... 0.8620 Positive 0.028126 Neutral 7.0 -0.041667

Extracted Features Analysis¶

In [24]:
paths = ['hvd.hw39sc.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())
In [25]:
darwin_ef = vol.tokenlist(pos=False, case=False) \
    .reset_index().drop(['section'], axis=1)
darwin_ef.columns = ['Page Number', 'token', 'count']
grouped_tokens = darwin_ef.groupby('Page Number')
In [26]:
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []

for name, group in grouped_tokens:
    page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])

    # VADER Analysis
    sentiment_scores = analyzer.polarity_scores(page_text)
    ef_vader_sentiment_score.append(sentiment_scores['compound'])

    # TextBlob Analysis
    sentiment = TextBlob(page_text).sentiment
    ef_textblob_sentiment_score.append(sentiment.polarity)

    # AFINN Analysis
    sentiment_score = afinn.score(page_text)
    ef_afinn_sentiment_score.append(sentiment_score)

# Create a DataFrame with all sentiment analysis results
darwin_ef = pd.DataFrame({
    'page_number': [int(x) for x in grouped_tokens.groups.keys()],
    'ef_vader_sentiment_score': ef_vader_sentiment_score,
    'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
    'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})
In [27]:
# Normalize AFINN scores for EF
min_value = min(darwin_ef['ef_afinn_sentiment_score'])
max_value = max(darwin_ef['ef_afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_ef['ef_afinn_sentiment_score']]

ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]

darwin_ef['ef_afinn_normalized'] = ef_afinn_normalized

darwin_ef[10:15]
Out[27]:
page_number ef_vader_sentiment_score ef_textblob_sentiment_score ef_afinn_sentiment_score ef_afinn_normalized
10 15 0.7574 -0.021173 5.0 -0.106383
11 16 0.8779 0.082906 -1.0 -0.361702
12 17 0.8497 0.121780 4.0 -0.148936
13 18 0.9545 0.183201 18.0 0.446809
14 19 0.0516 0.127521 1.0 -0.276596
In [28]:
# Subtract the number of pages in front matter from EF DataFrame
FRONT_MATTER_PAGES = 16
darwin_ef['page_number'] = darwin_ef['page_number'].astype(int) - FRONT_MATTER_PAGES

Emotional Valence Graph¶

In [29]:
# Smoothing the graph with rolling mean
WINDOW_SIZE = 20

columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']

for col in columns_full_text:
    darwin_full_text[col] = darwin_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']

for col in columns_ef:
    darwin_ef[col] = darwin_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()
In [30]:
fig = go.Figure()

# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
    'vader_sentiment_score': 'Full Text VADER',
    'textblob_sentiment_score': 'Full Text TextBlob',
    'afinn_normalized': 'Full Text AFINN Normalized'
}

for column, label in sentiment_scores_full_text.items():
    fig.add_trace(go.Scatter(x=darwin_full_text['page_number'], y=darwin_full_text[column], mode='lines', name=label))

# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
    'ef_vader_sentiment_score': 'EF VADER',
    'ef_textblob_sentiment_score': 'EF TextBlob',
    'ef_afinn_normalized': 'EF AFINN Normalized'
}

for column, label in sentiment_scores_ef.items():
    fig.add_trace(go.Scatter(x=darwin_ef['page_number'], y=darwin_ef[column], mode='lines', name=label))

# Additional plot settings
fig.update_layout(
    title=f'Emotional Valence throughout "The Origin of Species"',
    xaxis_title='Page Number',
    yaxis_title='Sentiment Score',
    showlegend=True
)

fig.show()

Exploring Large Language Models (LLMs)¶

During my project, I experimented with sentiment analysis using Large Language Models (LLMs) such as BERTweet and SiEBERT. However, the approach encountered several challenges. For the full text analysis, the length of content on each page often exceeded the token limit of these models. I attempted to segment the page content into individual sentences using spaCy, but occasionally, even these sentences were too lengthy. Truncating sentences to fit within the token limit compromised the accuracy of the analysis.

In the case of Extracted Features, the use of LLMs proved impractical. The token lists comprised isolated words without context, which contradicts the advantage of LLMs: analyzing more extended sentences or texts to understand the overall sentiment. Additionally, the computational demands of running LLMs exceeded the capabilities of my available hardware.

Given these limitations and the challenges, I ultimately decided against including LLMs in the final iteration of my project.

Conclusion¶

In analyzing the graphs for both the fiction and nonfiction examples, several key observations emerged:

  • The disparity in sentiment scores between the full text and Extracted Features using the same tool was relatively minor. In contrast, there were more significant variations when comparing different tools analyzing the same input.
  • TextBlob's sentiment scores generally hovered closer to neutral (0), while VADER's scores showed a greater deviation from neutrality.
  • As for the overall sentiment direction, VADER typically presented more positive scores. TextBlob's scores were moderately positive, while AFINN's scores leaned more towards the negative spectrum.

These findings highlight the distinct characteristics and tendencies of each sentiment analysis tool when applied to both fiction and nonfiction works.

In closing, I would like to express my gratitude to Glen Layne-Worthey for his invaluable guidance and encouragement throughout this project. I am also deeply thankful to Ryan Dubnicek for sharing his technical expertise, which greatly aided this work.

Further Readings¶

Bowers, Katherine and Quinn Dombrowski. “Katia and the Sentiment Snobs”. The Data-Sitters Club. October 25, 2021. https://datasittersclub.github.io/site/dsc11.html.

Organisciak, Peter and Boris Capitanu. "Text Mining in Python through the HTRC Feature Reader." Programming Historian 5 (2016). https://doi.org/10.46430/phen0058.